# Summary of the National A Paper
The National A Paper covers five provinces: Sichuan, Inner Mongolia, Shaanxi, Qinghai, and Ningxia. This test utilized the complete set of subject questions from the National A Paper (excluding political questions, which have not yet been made public). Specific questions can be viewed at [Gaokao Express](https://easylearn.baidu.com/gaokao/content/list?tabKey=question).

# Evaluation
During the evaluation process, the model's responses were randomly labeled as A, B, C, D, E, F, and G for scoring by teachers. The scoring was conducted according to the following criteria:
- For the subjects of Chinese, Mathematics, and English, images were discarded, and pure text reasoning was used (consistent with the New Curriculum Paper).
- Points were awarded only if the answers for single-choice questions, fill-in-the-blank questions, and answers in each subject were completely consistent.
- In Mathematics, multiple-choice questions were scored based on the proportion of correct options; if there were any incorrect options, no points were awarded.
- Subjective questions received partial credit based on the correctness of the steps shown.
- Essay questions were scored according to established essay scoring criteria.
- Questions in the comprehensive subjects that included images were scored by multimodal models from that series. For models like Mixtral, which only have pure text versions, scores were assigned based on text-only evaluations. The Qwen2 model, having only released the QwenVL-7B version, may show a discrepancy between the results of the Qwen multimodal model and the model's actual capabilities.
- Due to the poor performance of the QwenVL-7B, to better reflect the true level of the Qwen series, we also evaluated the Qwen2-72B text model on multimodal questions in Physics, Chemistry, and Geography from the National A Paper.
Additionally, to ensure the reproducibility of the model results, all answers, except for essays, were generated by each model using greedy decoding.

## Overall Scores
The total scores of the models participating in the examination are displayed as follows:

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: center;">
      <th colspan="13"  style="text-align: center;">Score of National Paper A (sorted by total score of science)</th>
    </tr>
  </thead>
  <tbody>
    <tr style="text-align: center;">
      <td>Model</td>
      <td>Research Institution</td>
      <td>Chinese</td>
      <td>English</td>
      <td>Mathematics (Science)</td>
      <td>Physics</td>
      <td>Chemistry</td>
      <td>Biology</td>
      <td>Mathematics (Arts)</td>
      <td>History</td>
      <td>Geography</td>
      <td>Total Science Score</td>
      <td>Total Humanities Score (Excluding Politics)</td>
	</tr>
    <tr style="text-align: center;">
      <td>Qwen2-72B text only</td>
      <td>Alibaba</td>
      <td>128</td>
      <td>141</td>
      <td>89</td>
      <td>32</td>
      <td>48</td>
      <td>50</td>
      <td>95</td>
      <td>71</td>
      <td>81</td>
      <td>488</td>
      <td>516</td>
    </tr>
    <tr style="text-align: center;">
      <td>GPT-4o</td>
      <td>OpenAI</td>
      <td>122</td>
      <td>142.5</td>
      <td>84</td>
      <td>31</td>
      <td>34</td>
      <td>72</td>
      <td>89</td>
      <td>82</td>
      <td>66</td>
      <td>485.5</td>
      <td>501.5</td>
    </tr>
    <tr style="text-align: center;">
      <td>WQX+VL-20B</td>
      <td>Ours</td>
      <td>111</td>
      <td>141</td>
      <td>78</td>
      <td>30</td>
      <td>52</td>
      <td>50</td>
      <td>71</td>
      <td>76</td>
      <td>64</td>
      <td>462</td>
      <td>463</td>
    </tr>
    <tr style="text-align: center;">
      <td>Qwen2-72B+VL-7B</td>
      <td>Alibaba</td>
      <td>128</td>
      <td>141</td>
      <td>89</td>
      <td>22</td>
      <td>22</td>
      <td>50</td>
      <td>95</td>
      <td>71</td>
      <td>34</td>
      <td>452</td>
      <td>469</td>
    </tr>
    <tr style="text-align: center;">
      <td>Mixtral 8x22B</td>
      <td>Mistral</td>
      <td>92</td>
      <td>142</td>
      <td>58</td>
      <td>38</td>
      <td>39</td>
      <td>54</td>
      <td>53</td>
      <td>74</td>
      <td>74</td>
      <td>423</td>
      <td>435</td>
    </tr>
    <tr style="text-align: center;">
      <td>GLM4-9B+VL-9B</td>
      <td>Zhipu AI</td>
      <td>108</td>
      <td>110.5</td>
      <td>71</td>
      <td>29</td>
      <td>44</td>
      <td>55</td>
      <td>75</td>
      <td>54</td>
      <td>62</td>
      <td>417.5</td>
      <td>409.5</td>
    </tr>
    <tr style="text-align: center;">
      <td>Qwen2-57B+VL-7B</td>
      <td>Alibaba</td>
      <td>108</td>
      <td>141</td>
      <td>65</td>
      <td>6</td>
      <td>22</td>
      <td>44</td>
      <td>75</td>
      <td>77</td>
      <td>30</td>
      <td>386</td>
      <td>431</td>
    </tr>
    <tr style="text-align: center;">
      <td>Yi-34B+VL-34B</td>
      <td>01.AI</td>
      <td>109</td>
      <td>107.5</td>
      <td>39</td>
      <td>15</td>
      <td>40</td>
      <td>55.5</td>
      <td>65</td>
      <td>53</td>
      <td>54</td>
      <td>366</td>
      <td>388.5</td>
    </tr>
  </tbody>
</table>

Questions with asterisks (\*) indicate that the question contains images. If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.


## Chinese
The scores for each section of the test paper are as follows: 
<table border="1"> <tr style="text-align: center;"> 
    <th colspan="8" style="text-align: center;">Score Distribution for Each Question Type in Chinese</th> 
</tr> 
<tr style="text-align: center;"> 
    <td>Model</td> 
    <td>Modern Text Reading (Total Score: 36)</td> 
    <td>Classical Chinese Reading (Total Score: 19)</td> 
    <td>Ancient Poetry Reading (Total Score: 9)</td> 
    <td>Memorization of Famous Works and Quotations (Total Score: 6)</td> 
    <td>Language and Text Application (Total Score: 20)</td> 
    <td>Essay (Total Score: 60)</td> 
    <td>Total Score (Maximum: 150)</td> 
</tr> <tr style="text-align: center;"> <td>Qwen2-72B</td> <td>35</td> <td>19</td> <td>9</td> <td>2</td> <td>15</td> <td>48</td> <td>128</td> </tr> <tr style="text-align: center;"> <td>GPT-4o</td> <td>29</td> <td>19</td> <td>8</td> <td>4</td> <td>14</td> <td>48</td> <td>122</td> </tr> <tr style="text-align: center;"> <td>WQX-20B</td> <td>26</td> <td>14</td> <td>7</td> <td>6</td> <td>15</td> <td>43</td> <td>111</td> </tr> <tr style="text-align: center;"> <td>Yi-1.5-34B+VL-34B</td> <td>28</td> <td>12</td> <td>7</td> <td>0</td> <td>16</td> <td>46</td> <td>109</td> </tr> <tr style="text-align: center;"> <td>GLM4-9B+4v-9B</td> <td>24</td> <td>13</td> <td>8</td> <td>2</td> <td>15</td> <td>46</td> <td>108</td> </tr> <tr style="text-align: center;"> <td>Qwen2-57B</td> <td>27</td> <td>14</td> <td>7</td> <td>2</td> <td>14</td> <td>44</td> <td>108</td> </tr> <tr style="text-align: center;"> <td>Mixtral 8x22B</td> <td>24</td> <td>0</td> <td>7</td> <td>0</td> <td>14</td> <td>47</td> <td>92</td> </tr> </table>

The scores for each question in the test paper are as follows: 
<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">Chinese</th>
        <th rowspan="2">Question number</th>
<th colspan="3">Modern Text Reading I</th><th colspan="3">Modern Text Reading II</th><th colspan="3">Modern Text Reading III</th><th colspan="4">Classical Chinese Reading</th><th colspan="2">Classical Poetry and Prose Reading</th><th colspan="1">Memorization of Famous Works and Quotes</th><th colspan="4">Language and Text Application I</th><th colspan="1">Language and Text Application II</th><th colspan="1">Essay</th><th rowspan="2">Total Score</th></tr>
<tr style="text-align: center;"><th>1.1</th><th>1.2</th><th>1.3</th><th>2.1</th><th>2.2</th><th>2.3</th><th>3.1</th><th>3.2</th><th>3.3</th><th>4.1</th><th>4.2</th><th>4.3</th><th>4.4</th><th>5.1</th><th>5.2</th><th>6</th><th>7.1</th><th>7.2</th><th>7.3</th><th>7.4</th><th>7.5</th><th>8</th></tr><tr style="text-align: center;"><td>Test Model</td><td>Score</td><td>3</td><td>3</td><td>3</td><td>3</td><td>3</td><td>6</td><td>3</td><td>6</td><td>6</td><td>3</td><td>3</td><td>3</td><td>10</td><td>3</td><td>6</td><td>6</td><td>3</td><td>4</td><td>3</td><td>4</td><td>6</td><td>60</td><td>150	(100%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td></td><td>3</td><td>3</td><td>3</td><td>3</td><td>2</td><td>6</td><td>3</td><td>6</td><td>6</td><td>3</td><td>3</td><td>3</td><td>10</td><td>3</td><td>6</td><td>2</td><td>3</td><td>4</td><td>0</td><td>2</td><td>6</td><td>48</td><td>128	(85.3%)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td></td><td>3</td><td>3</td><td>3</td><td>3</td><td>2</td><td>4</td><td>0</td><td>5</td><td>6</td><td>3</td><td>3</td><td>3</td><td>10</td><td>3</td><td>5</td><td>4</td><td>3</td><td>2</td><td>0</td><td>4</td><td>5</td><td>48</td><td>122	(81.3%)</td></tr><tr style="text-align: center;"><td>WQX-20B</td><td></td><td>3</td><td>3</td><td>3</td><td>3</td><td>2</td><td>4</td><td>0</td><td>3</td><td>5</td><td>3</td><td>3</td><td>0</td><td>8</td><td>3</td><td>4</td><td>6</td><td>3</td><td>4</td><td>0</td><td>4</td><td>4</td><td>43</td><td>111	(74%)</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B</td><td></td><td>3</td><td>3</td><td>3</td><td>3</td><td>3</td><td>3</td><td>0</td><td>5</td><td>5</td><td>1</td><td>3</td><td>0</td><td>8</td><td>3</td><td>4</td><td>0</td><td>3</td><td>4</td><td>3</td><td>0</td><td>6</td><td>46</td><td>109	(72.7%)</td></tr><tr style="text-align: center;"><td>GLM4-9B</td><td></td><td>3</td><td>3</td><td>3</td><td>0</td><td>2</td><td>4</td><td>0</td><td>3</td><td>6</td><td>2</td><td>3</td><td>0</td><td>8</td><td>3</td><td>5</td><td>2</td><td>3</td><td>4</td><td>0</td><td>4</td><td>4</td><td>46</td><td>108	(72%)</td></tr><tr style="text-align: center;"><td>Qwen2-57B</td><td></td><td>3</td><td>3</td><td>3</td><td>3</td><td>2</td><td>3</td><td>3</td><td>2</td><td>5</td><td>3</td><td>0</td><td>3</td><td>8</td><td>3</td><td>4</td><td>2</td><td>3</td><td>4</td><td>0</td><td>2</td><td>5</td><td>44</td><td>108	(72%)</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td></td><td>3</td><td>3</td><td>3</td><td>3</td><td>2</td><td>2</td><td>0</td><td>3</td><td>5</td><td>0</td><td>0</td><td>0</td><td>0</td><td>3</td><td>4</td><td>0</td><td>3</td><td>3</td><td>3</td><td>0</td><td>5</td><td>47</td><td>92	(61.3%)</td></tr></table>
## Mathematics (Arts)
The scores for each section of the test paper are as follows: 
<table border="1">
<tr style="text-align: center;">
    <th colspan="6" style="text-align: center;">Score Distribution for Each Question Type in Mathematics (Arts)</th>
</tr>
<tr style="text-align: center;">
    <td>Model</td> <td>Single Choice Questions (Total Score: 60)</td> <td>Fill-in-the-Blank Questions (Total Score: 20)</td> <td>Short Answer Questions (Total Score: 60)</td> <td>Elective Questions - Short Answer Questions (Total Score: 20)</td> <td>Total Score (Maximum: 150)</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td>50</td><td>15</td><td>20</td><td>14</td><td>95</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td>40</td><td>15</td><td>24</td><td>10</td><td>89</td></tr><tr style="text-align: center;"><td>GLM4-9B</td><td>35</td><td>10</td><td>27</td><td>3</td><td>75</td></tr><tr style="text-align: center;"><td>Qwen2-57B</td><td>40</td><td>10</td><td>18</td><td>9</td><td>75</td></tr><tr style="text-align: center;"><td>WQX-20B</td><td>30</td><td>15</td><td>26</td><td>0</td><td>71</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B</td><td>25</td><td>5</td><td>31</td><td>6</td><td>65</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td>30</td><td>5</td><td>15</td><td>3</td><td>53</td></tr></table>The scores for each question in the test paper are as follows: 
<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">Mathematics (Arts)</th>
        <th rowspan="2">Question number</th>
<th colspan="12">Single Choice Questions</th><th colspan="4">Fill-in-the-Blank Questions</th><th colspan="5">Short Answer Questions</th><th colspan="2">Elective Questions - Short Answer</th><th rowspan="2">Total Score</th></tr>
<tr style="text-align: center;"><th>1</th><th>2</th><th>3</th><th>4</th><th>5</th><th>6</th><th>7</th><th>8</th><th>9</th><th>10</th><th>11</th><th>12</th><th>13</th><th>14</th><th>15</th><th>16</th><th>17</th><th>18</th><th>19</th><th>20</th><th>21</th><th>22</th><th>23</th></tr><tr style="text-align: center;"><td>Test Model</td><td>Score</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>12</td><td>12</td><td>12</td><td>12</td><td>12</td><td>10</td><td>10</td><td>150	(100%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td></td><td>5</td><td>5</td><td>5</td><td>5</td><td>0</td><td>5</td><td>5</td><td>0</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>0</td><td>0</td><td>9</td><td>0</td><td>9</td><td>2</td><td>10</td><td>4</td><td>95	(63.3%)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td></td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>0</td><td>5</td><td>0</td><td>0</td><td>0</td><td>5</td><td>5</td><td>5</td><td>0</td><td>1</td><td>10</td><td>0</td><td>8</td><td>5</td><td>10</td><td>0</td><td>89	(59.3%)</td></tr><tr style="text-align: center;"><td>GLM4-9B</td><td></td><td>5</td><td>5</td><td>0</td><td>5</td><td>0</td><td>0</td><td>5</td><td>5</td><td>5</td><td>5</td><td>0</td><td>0</td><td>5</td><td>5</td><td>0</td><td>0</td><td>0</td><td>10</td><td>2</td><td>7</td><td>8</td><td>0</td><td>3</td><td>75	(50%)</td></tr><tr style="text-align: center;"><td>Qwen2-57B</td><td></td><td>5</td><td>5</td><td>0</td><td>5</td><td>5</td><td>5</td><td>5</td><td>0</td><td>5</td><td>0</td><td>5</td><td>0</td><td>5</td><td>5</td><td>0</td><td>0</td><td>2</td><td>9</td><td>3</td><td>2</td><td>2</td><td>7</td><td>2</td><td>75	(50%)</td></tr><tr style="text-align: center;"><td>WQX-20B</td><td></td><td>5</td><td>5</td><td>0</td><td>5</td><td>0</td><td>5</td><td>5</td><td>0</td><td>0</td><td>0</td><td>5</td><td>0</td><td>5</td><td>5</td><td>0</td><td>5</td><td>2</td><td>10</td><td>0</td><td>8</td><td>6</td><td>0</td><td>0</td><td>71	(47.3%)</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B</td><td></td><td>5</td><td>5</td><td>5</td><td>5</td><td>0</td><td>0</td><td>5</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>5</td><td>0</td><td>0</td><td>0</td><td>4</td><td>9</td><td>0</td><td>12</td><td>6</td><td>4</td><td>2</td><td>65	(43.3%)</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td></td><td>5</td><td>5</td><td>0</td><td>5</td><td>0</td><td>0</td><td>0</td><td>0</td><td>5</td><td>5</td><td>5</td><td>0</td><td>5</td><td>0</td><td>0</td><td>0</td><td>0</td><td>9</td><td>0</td><td>4</td><td>2</td><td>0</td><td>3</td><td>53	(35.3%)</td></tr></table>

Questions with asterisks (\*) indicate that the question contains images. If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.

## Mathematics (Science)
The scores for each section of the test paper are as follows: 
<table border="1">
<tr style="text-align: center;">
    <th colspan="6" style="text-align: center;">Score Distribution for Each Question Type in Mathematics (Science)</th>
</tr>
<tr style="text-align: center;">
     <td>Model</td> <td>Single Choice Questions (Total Score: 60)</td> <td>Fill-in-the-Blank Questions (Total Score: 20)</td> <td>Short Answer Questions (Total Score: 60)</td> <td>Elective Questions - Short Answer Questions (Total Score: 20)</td> <td>Total Score (Maximum: 150)</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td>50</td><td>10</td><td>19</td><td>15</td><td>89</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td>35</td><td>15</td><td>27</td><td>12</td><td>84</td></tr><tr style="text-align: center;"><td>WQX-20B</td><td>35</td><td>5</td><td>38</td><td>0</td><td>78</td></tr><tr style="text-align: center;"><td>GLM4-9B</td><td>35</td><td>5</td><td>28</td><td>3</td><td>71</td></tr><tr style="text-align: center;"><td>Qwen2-57B</td><td>40</td><td>5</td><td>13</td><td>13</td><td>65</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td>30</td><td>0</td><td>21</td><td>12</td><td>58</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B</td><td>20</td><td>0</td><td>17</td><td>2</td><td>39</td></tr></table>The scores for each question in the test paper are as follows: 
<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">Mathematics (Science)</th>
        <th rowspan="2">Question number</th>
<th colspan="12">Single Choice Questions</th><th colspan="4">Fill-in-the-Blank Questions</th><th colspan="5">Short Answer Questions</th><th colspan="2">Elective Questions - Short Answer</th><th rowspan="2">Total Score</th></tr>
<tr style="text-align: center;"><th>1</th><th>2</th><th>3</th><th>4</th><th>5</th><th>6</th><th>7</th><th>8</th><th>9</th><th>10</th><th>11</th><th>12</th><th>13</th><th>14</th><th>15</th><th>16</th><th>17</th><th>18</th><th>19</th><th>20</th><th>21</th><th>22</th><th>23</th></tr><tr style="text-align: center;"><td>Test Model</td><td>Score</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>12</td><td>12</td><td>12</td><td>12</td><td>12</td><td>10</td><td>10</td><td>150	(100%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td></td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>0</td><td>5</td><td>0</td><td>5</td><td>5</td><td>5</td><td>0</td><td>5</td><td>5</td><td>0</td><td>4</td><td>7</td><td>0</td><td>4</td><td>4</td><td>10</td><td>5</td><td>89	(59.3%)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td></td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>5</td><td>0</td><td>5</td><td>0</td><td>0</td><td>0</td><td>0</td><td>5</td><td>5</td><td>5</td><td>0</td><td>10</td><td>2</td><td>0</td><td>5</td><td>10</td><td>7</td><td>5</td><td>84	(56%)</td></tr><tr style="text-align: center;"><td>WQX-20B</td><td></td><td>5</td><td>5</td><td>0</td><td>5</td><td>5</td><td>5</td><td>0</td><td>0</td><td>0</td><td>5</td><td>0</td><td>5</td><td>0</td><td>5</td><td>0</td><td>0</td><td>10</td><td>6</td><td>4</td><td>8</td><td>10</td><td>0</td><td>0</td><td>78	(52%)</td></tr><tr style="text-align: center;"><td>GLM4-9B</td><td></td><td>5</td><td>5</td><td>0</td><td>5</td><td>0</td><td>5</td><td>5</td><td>5</td><td>0</td><td>0</td><td>0</td><td>5</td><td>0</td><td>5</td><td>0</td><td>0</td><td>4</td><td>7</td><td>0</td><td>12</td><td>5</td><td>0</td><td>3</td><td>71	(47.3%)</td></tr><tr style="text-align: center;"><td>Qwen2-57B</td><td></td><td>5</td><td>5</td><td>5</td><td>0</td><td>5</td><td>5</td><td>0</td><td>5</td><td>5</td><td>5</td><td>0</td><td>0</td><td>0</td><td>5</td><td>0</td><td>0</td><td>6</td><td>0</td><td>0</td><td>2</td><td>5</td><td>7</td><td>6</td><td>65	(43.3%)</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td></td><td>5</td><td>5</td><td>0</td><td>5</td><td>0</td><td>0</td><td>0</td><td>5</td><td>5</td><td>5</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>6</td><td>9</td><td>0</td><td>2</td><td>4</td><td>5</td><td>7</td><td>58	(38.7%)</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B</td><td></td><td>5</td><td>5</td><td>5</td><td>0</td><td>0</td><td>5</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>6</td><td>1</td><td>0</td><td>6</td><td>4</td><td>2</td><td>0</td><td>39	(26%)</td></tr></table>


Questions with asterisks (\*) indicate that the question contains images. If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.

## English

The scores for each section of the test paper are as follows: 
<table border="1"> <tr style="text-align: center;"> <th colspan="8" style="text-align: center;">Score Distribution for Each Question Type in English</th> </tr> <tr style="text-align: center;"> <td>Model</td> <td>Reading Comprehension (Total Score: 30)</td> <td>Choose 5 out of 7 (Total Score: 10)</td> <td>Cloze Test (Total Score: 30)</td> <td>Grammar Completion (Total Score: 15)</td> <td>Writing (Total Score: 35)</td> <td>Listening (Total Score: 30)</td> <td>Total Score (Total Score: 150)</td> </tr> <tr style="text-align: center;"> <td>GPT-4o</td> <td>30</td> <td>10</td> <td>28.5</td> <td>15</td> <td>29</td> <td>30</td> <td>142.5</td> </tr> <tr style="text-align: center;"> <td>Mixtral 8x22B</td> <td>30</td> <td>10</td> <td>30</td> <td>15</td> <td>27</td> <td>30</td> <td>142</td> </tr> <tr style="text-align: center;"> <td>Qwen2-72B</td> <td>30</td> <td>10</td> <td>30</td> <td>15</td> <td>26</td> <td>30</td> <td>141</td> </tr> <tr style="text-align: center;"> <td>WQX-20B</td> <td>30</td> <td>10</td> <td>28.5</td> <td>15</td> <td>27.5</td> <td>30</td> <td>141</td> </tr> <tr style="text-align: center;"> <td>Qwen2-57B</td> <td>28</td> <td>10</td> <td>30</td> <td>15</td> <td>28</td> <td>30</td> <td>141</td> </tr> <tr style="text-align: center;"> <td>GLM4-9B</td> <td>26</td> <td>0</td> <td>21</td> <td>12</td> <td>21.5</td> <td>30</td> <td>110.5</td> </tr> <tr style="text-align: center;"> <td>Yi-1.5-34B</td> <td>24</td> <td>8</td> <td>16.5</td> <td>13.5</td> <td>15.5</td> <td>30</td> <td>107.5</td> </tr> </table>

The scores for each question in the test paper are as follows: 
<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">English</th>
        <th rowspan="2">Question number</th>
        <th colspan="1">Reading Comprehension A</th>
        <th colspan="1">Reading Comprehension B</th>
        <th colspan="1">Reading Comprehension C</th>
        <th colspan="1">Reading Comprehension D</th>
        <th colspan="1">Select 5 out of 7</th>
        <th colspan="1">Cloze Test</th>
        <th colspan="1">Grammar Completion</th>
        <th colspan="1">Writing - Short Composition Correction</th>
        <th colspan="1">Writing - Written Expression</th>
        <th colspan="1">Listening</th>
        <th rowspan="2">Total Score</th>
    </tr>
    <tr style="text-align: center;">
        <th>1</th><th>2</th><th>3</th><th>4</th><th>5</th><th>6</th><th>7</th><th>8</th><th>9</th><th>10</th>
    </tr>
    <tr style="text-align: center;">
        <td>Test Model</td><td>Score</td><td>6</td><td>8</td><td>8</td><td>8</td><td>10</td><td>30</td><td>15</td><td>10</td><td>25</td><td>30</td><td>150 (100%)</td>
    </tr>
    <tr style="text-align: center;"><td>GPT-4o</td><td></td><td>6</td><td>8</td><td>8</td><td>8</td><td>10</td><td>28.5</td><td>15</td><td>8</td><td>21</td><td>30</td><td>142.5 (95%)</td></tr>
    <tr style="text-align: center;"><td>Mixtral 8x22B</td><td></td><td>6</td><td>8</td><td>8</td><td>8</td><td>10</td><td>30</td><td>15</td><td>8</td><td>19</td><td>30</td><td>142 (94.7%)</td></tr>
    <tr style="text-align: center;"><td>Qwen2-72B</td><td></td><td>6</td><td>8</td><td>8</td><td>8</td><td>10</td><td>30</td><td>15</td><td>8</td><td>18</td><td>30</td><td>141 (94%)</td></tr>
    <tr style="text-align: center;"><td>WQX-20B</td><td></td><td>6</td><td>8</td><td>8</td><td>8</td><td>10</td><td>28.5</td><td>15</td><td>9</td><td>18.5</td><td>30</td><td>141 (94%)</td></tr>
    <tr style="text-align: center;"><td>Qwen2-57B</td><td></td><td>6</td><td>8</td><td>6</td><td>8</td><td>10</td><td>30</td><td>15</td><td>9</td><td>19</td><td>30</td><td>141 (94%)</td></tr>
    <tr style="text-align: center;"><td>GLM4-9B</td><td></td><td>6</td><td>8</td><td>6</td><td>6</td><td>0</td><td>21</td><td>12</td><td>6</td><td>15.5</td><td>30</td><td>110.5 (73.7%)</td></tr>
    <tr style="text-align: center;"><td>Yi-1.5-34B</td><td></td><td>4</td><td>8</td><td>6</td><td>6</td><td>8</td><td>16.5</td><td>13.5</td><td>5</td><td>10.5</td><td>30</td><td>107.5 (71.7%)</td></tr>
</table>

## Physics
The scores for each section of the test paper are as follows: 
<table border="1">
<tr style="text-align: center;">
    <th colspan="7" style="text-align: center;">Score Distribution for Each Question Type in Physics</th>
</tr>
<tr style="text-align: center;">
    <td>Model</td>
<td>Multiple Choice Questions (Total Score: 48)</td>
<td>Fill-in-the-Blank Questions (Total Score: 15)</td>
<td>Short Answer Questions (Total Score: 32)</td>
<td>Elective Questions - Multiple Choice (Total Score: 10)</td>
<td>Elective Questions (Total Score: 20)</td>
<td>Total Score (Total Score: 110)</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td>27</td><td>1</td><td>9</td><td>1</td><td>0</td><td>38</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td>18</td><td>1</td><td>9</td><td>0</td><td>4</td><td>32</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td>15</td><td>5</td><td>10</td><td>1</td><td>0</td><td>31</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td>24</td><td>1</td><td>4</td><td>1</td><td>0</td><td>30</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td>18</td><td>2</td><td>6</td><td>2</td><td>1</td><td>29</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td>12</td><td>2</td><td>8</td><td>0</td><td>0</td><td>22</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td>9</td><td>0</td><td>6</td><td>0</td><td>0</td><td>15</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td>0</td><td>2</td><td>4</td><td>0</td><td>0</td><td>6</td></tr></table>The scores for each question in the test paper are as follows: 
<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">Physics</th>
        <th rowspan="2">Question number</th>
<th colspan="8">Multiple Choice Questions</th><th colspan="2">Fill-in-the-Blank Questions</th><th colspan="2">Short Answer Questions</th><th colspan="2">Elective Questions - Multiple Choice</th><th colspan="2">Elective Questions</th><th rowspan="2">Total Score</th><th rowspan="2">Total Score for Questions with Diagrams</th><th rowspan="2">Total Score for Questions without Diagrams</th></tr>
<tr style="text-align: center;"><th>1</th><th>2*</th><th>3</th><th>4*</th><th>5*</th><th>6*</th><th>7*</th><th>8*</th><th>9*</th><th>10*</th><th>11</th><th>12*</th><th>13.1</th><th>13.2</th><th>14.1</th><th>14.2</th></tr><tr style="text-align: center;"><td>Test Model</td><td>Score</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>5</td><td>10</td><td>12</td><td>20</td><td>5</td><td>10</td><td>5</td><td>10</td><td>110	(100%)</td><td>71	(65%)</td><td>39	(35%)</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td></td><td>0</td><td>6</td><td>6</td><td>6</td><td>6</td><td>3</td><td>0</td><td>0</td><td>1</td><td>0</td><td>4</td><td>5</td><td>0</td><td>0</td><td>1</td><td>0</td><td>38	(34.5%)</td><td>27	(38%)</td><td>11	(28.2%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td></td><td>6</td><td>0</td><td>6</td><td>0</td><td>0</td><td>3</td><td>3</td><td>0</td><td>1</td><td>0</td><td>6</td><td>3</td><td>1</td><td>1</td><td>0</td><td>4</td><td>33	(30%)</td><td>10	(14.1%)</td><td>23	(61.5%)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td></td><td>6</td><td>0</td><td>6</td><td>0</td><td>0</td><td>0</td><td>0</td><td>3</td><td>3</td><td>2</td><td>7</td><td>3</td><td>1</td><td>0</td><td>0</td><td>0</td><td>31	(28.2%)</td><td>11	(15.5%)</td><td>20	(51.3%)</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td></td><td>6</td><td>0</td><td>6</td><td>6</td><td>0</td><td>0</td><td>0</td><td>6</td><td>1</td><td>0</td><td>4</td><td>0</td><td>1</td><td>0</td><td>0</td><td>0</td><td>30	(27.3%)</td><td>13	(18.3%)</td><td>17	(43.6%)</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td></td><td>6</td><td>0</td><td>6</td><td>0</td><td>0</td><td>3</td><td>0</td><td>3</td><td>0</td><td>2</td><td>4</td><td>2</td><td>2</td><td>1</td><td>0</td><td>0</td><td>29	(26.4%)</td><td>10	(14.1%)</td><td>19	(48.7%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td></td><td>6</td><td>0</td><td>6</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>2</td><td>8</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>22	(20%)</td><td>2	(2.8%)</td><td>20	(51.3%)</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td></td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>3</td><td>3</td><td>3</td><td>0</td><td>0</td><td>4</td><td>2</td><td>0</td><td>0</td><td>0</td><td>0</td><td>15	(13.6%)</td><td>11	(15.5%)</td><td>4	(10.3%)</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td></td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>2</td><td>4</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>6	(5.5%)</td><td>2	(2.8%)</td><td>4	(10.3%)</td></tr></table>

Questions with asterisks (\*) indicate that the question contains images. If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.

## Chemistry

The scores for each section of the test paper are as follows: 
<table border="1">
<tr style="text-align: center;">
    <th colspan="5" style="text-align: center;">Score Distribution for Each Question Type in Chemistry</th>
</tr>
<tr style="text-align: center;">
    <td>Model</td> <td>Multiple Choice Questions (Total Score: 42)</td><td>Fill-in-the-Blank Questions (Total Score: 43)</td><td>Elective Questions - Fill-in-the-Blank Questions (Total Score: 30)</td><td>Total Score (Total Score: 100)</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td>30</td><td>15</td><td>10</td><td>52</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td>24</td><td>13</td><td>13</td><td>48</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td>24</td><td>15</td><td>7</td><td>44</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td>24</td><td>13</td><td>4</td><td>40</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td>24</td><td>8</td><td>7</td><td>39</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td>12</td><td>14</td><td>8</td><td>34</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td>12</td><td>7</td><td>5</td><td>22</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td>12</td><td>7</td><td>5</td><td>22</td></tr></table>The scores for each question in the test paper are as follows: 
<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">Chemistry</th>
        <th rowspan="2">Question number</th>
<th colspan="7">Multiple Choice Questions</th><th colspan="3">Fill-in-the-Blank Questions</th><th colspan="2">Elective Questions - Fill-in-the-Blank Questions</th><th rowspan="2">Total Score</th><th rowspan="2">Total Score for Questions with Diagrams</th><th rowspan="2">Total Score for Questions without Diagrams</th></tr>
<tr style="text-align: center;"><th>1</th><th>2</th><th>3*</th><th>4*</th><th>5</th><th>6*</th><th>7*</th><th>8*</th><th>9*</th><th>10*</th><th>11*</th><th>12*</th></tr><tr style="text-align: center;"><td>Test Model</td><td>Score</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>14</td><td>14</td><td>15</td><td>15</td><td>15</td><td>100	(100%)</td><td>82	(82%)</td><td>18	(18%)</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td></td><td>6</td><td>6</td><td>6</td><td>0</td><td>6</td><td>6</td><td>0</td><td>4</td><td>8</td><td>3</td><td>7</td><td>3</td><td>52	(52%)</td><td>37	(45.1%)</td><td>15	(100%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B</td><td></td><td>6</td><td>6</td><td>6</td><td>0</td><td>0</td><td>6</td><td>0</td><td>3</td><td>8</td><td>2</td><td>11</td><td>2</td><td>48	(48%)</td><td>38	(46.3%)</td><td>10	(66.7%)</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td></td><td>6</td><td>6</td><td>6</td><td>0</td><td>6</td><td>0</td><td>0</td><td>5</td><td>7</td><td>3</td><td>5</td><td>2</td><td>44	(44%)</td><td>28	(34.1%)</td><td>16	(100%)</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td></td><td>6</td><td>6</td><td>6</td><td>0</td><td>0</td><td>6</td><td>0</td><td>3</td><td>5</td><td>5</td><td>3</td><td>1</td><td>40	(40%)</td><td>29	(35.4%)</td><td>11	(66.7%)</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td></td><td>6</td><td>6</td><td>6</td><td>0</td><td>0</td><td>6</td><td>0</td><td>2</td><td>2</td><td>4</td><td>7</td><td>0</td><td>39	(39%)</td><td>27	(32.9%)</td><td>12	(66.7%)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td></td><td>6</td><td>6</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>6</td><td>5</td><td>3</td><td>8</td><td>0</td><td>34	(34%)</td><td>22	(26.8%)</td><td>12	(66.7%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td></td><td>6</td><td>6</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>2</td><td>3</td><td>2</td><td>3</td><td>2</td><td>22	(22%)</td><td>12	(14.6%)</td><td>10	(66.7%)</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td></td><td>6</td><td>6</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>2</td><td>3</td><td>2</td><td>3</td><td>2</td><td>22	(22%)</td><td>12	(14.6%)</td><td>10	(66.7%)</td></tr></table>


Questions with asterisks (\*) indicate that the question contains images. If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.

## Biology

The scores for each section of the test paper are as follows: 

<table border="1">
<tr style="text-align: center;">
    <th colspan="5" style="text-align: center;">Score Distribution for Each Question Type in Biology</th>
</tr>
<tr style="text-align: center;">
    <td>Model</td>
<td>Multiple Choice Questions (Full Score: 36)</td>
<td>Fill-in-the-Blank Questions (Full Score: 39)</td>
<td>Elective Questions - Fill-in-the-Blank Questions (Full Score: 30)</td>
<td>Total Score (Full Score: 90)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td>30</td><td>27</td><td>23</td><td>72</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td>30</td><td>10.5</td><td>26</td><td>55.5</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td>24</td><td>16</td><td>19</td><td>55</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td>18</td><td>21</td><td>24</td><td>54</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td>18</td><td>17</td><td>15</td><td>50</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td>18</td><td>21</td><td>21</td><td>50</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td>18</td><td>11</td><td>15</td><td>44</td></tr></table>The scores for each question in the test paper are as follows: 
<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">Biology</th>
        <th rowspan="2">Question number</th>
<th colspan="6">Multiple Choice Questions</th><th colspan="4">Fill-in-the-Blank Questions</th><th colspan="2">Elective Questions - Fill-in-the-Blank Questions</th><th rowspan="2">Total Score</th><th rowspan="2">Total Score for Questions with Diagrams</th><th rowspan="2">Total Score for Questions without Diagrams</th></tr>
<tr style="text-align: center;"><th>1</th><th>2</th><th>3</th><th>4*</th><th>5</th><th>6*</th><th>7</th><th>8*</th><th>9*</th><th>10</th><th>11</th><th>12*</th></tr><tr style="text-align: center;"><td>Test Model</td><td>Score</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>6</td><td>10</td><td>10</td><td>9</td><td>10</td><td>15</td><td>15</td><td>90	(100%)</td><td>31	(34%)</td><td>59	(66%)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td></td><td>6</td><td>6</td><td>6</td><td>6</td><td>0</td><td>6</td><td>8</td><td>4</td><td>5</td><td>10</td><td>15</td><td>8</td><td>72	(80%)</td><td>29	(93.5%)</td><td>43	(86.4%)</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td></td><td>6</td><td>6</td><td>6</td><td>6</td><td>0</td><td>6</td><td>4</td><td>4.5</td><td>2</td><td>0</td><td>15</td><td>11</td><td>55.5	(61.7%)</td><td>29.5	(95.2%)</td><td>26	(62.7%)</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td></td><td>6</td><td>6</td><td>6</td><td>0</td><td>0</td><td>6</td><td>4</td><td>3</td><td>3</td><td>6</td><td>15</td><td>4</td><td>55	(61.1%)</td><td>16	(51.6%)</td><td>39	(72.9%)</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td></td><td>0</td><td>6</td><td>6</td><td>0</td><td>0</td><td>6</td><td>6</td><td>4</td><td>5</td><td>6</td><td>15</td><td>9</td><td>54	(60%)</td><td>24	(77.4%)</td><td>30	(66.1%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td></td><td>6</td><td>6</td><td>6</td><td>0</td><td>0</td><td>0</td><td>8</td><td>3</td><td>0</td><td>6</td><td>15</td><td>0</td><td>50	(55.6%)</td><td>3	(9.7%)</td><td>47	(79.7%)</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td></td><td>6</td><td>6</td><td>6</td><td>0</td><td>0</td><td>0</td><td>4</td><td>8</td><td>5</td><td>4</td><td>11</td><td>10</td><td>50	(55.6%)</td><td>23	(74.2%)</td><td>27	(62.7%)</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td></td><td>6</td><td>6</td><td>6</td><td>0</td><td>0</td><td>0</td><td>4</td><td>3</td><td>0</td><td>4</td><td>15</td><td>0</td><td>44	(48.9%)</td><td>3	(9.7%)</td><td>41	(69.5%)</td></tr></table>

Questions with asterisks (\*) indicate that the question contains images. If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.

## History

The scores for each section of the test paper are as follows: 
<table border="1">
<tr style="text-align: center;">
    <th colspan="4" style="text-align: center;">Score Distribution for Each Question Type in History</th>
</tr>
<tr style="text-align: center;">
    <td>Model</td>  
<td>Multiple Choice Questions (Full Score: 48)</td><td>Short Answer Questions (Full Score: 52)</td><td>Total Score (Full Score: 100)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td>36</td><td>46</td><td>82</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td>40</td><td>37</td><td>77</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td>40</td><td>36</td><td>76</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td>36</td><td>38</td><td>74</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td>32</td><td>39</td><td>71</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td>20</td><td>34</td><td>54</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td>20</td><td>33</td><td>53</td></tr></table>The scores for each question in the test paper are as follows: 
<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">History</th>
        <th rowspan="2">Question number</th>
<th colspan="12">Multiple Choice Questions</th><th colspan="3">Short Answer Questions</th><th rowspan="2">Total Score</th><th rowspan="2">Total Score for Questions with Diagrams</th><th rowspan="2">Total Score for Questions without Diagrams</th></tr>
<tr style="text-align: center;"><th>1</th><th>2</th><th>3</th><th>4</th><th>5</th><th>6</th><th>7</th><th>8</th><th>9</th><th>10</th><th>11</th><th>12</th><th>13</th><th>14*</th><th>15</th></tr><tr style="text-align: center;"><td>Test Model</td><td>Score</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>25</td><td>12</td><td>15</td><td>100	(100%)</td><td>12	(12%)</td><td>88	(88%)</td></tr><tr style="text-align: center;"><td>GPT-4o</td><td></td><td>4</td><td>0</td><td>0</td><td>0</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>23</td><td>10</td><td>13</td><td>82	(82%)</td><td>10	(83.3%)</td><td>72	(81.8%)</td></tr><tr style="text-align: center;"><td>Qwen2-57B+VL-7B</td><td></td><td>4</td><td>4</td><td>0</td><td>0</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>21</td><td>4</td><td>12</td><td>77	(77%)</td><td>4	(33.3%)</td><td>73	(83%)</td></tr><tr style="text-align: center;"><td>WQX-20B+VL-20B</td><td></td><td>4</td><td>0</td><td>0</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>17</td><td>8</td><td>11</td><td>76	(76%)</td><td>8	(66.7%)</td><td>68	(77.3%)</td></tr><tr style="text-align: center;"><td>Mixtral 8x22B</td><td></td><td>0</td><td>4</td><td>4</td><td>0</td><td>4</td><td>4</td><td>0</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>22</td><td>7</td><td>9</td><td>74	(74%)</td><td>7	(58.3%)</td><td>67	(76.1%)</td></tr><tr style="text-align: center;"><td>Qwen2-72B+VL-7B</td><td></td><td>4</td><td>0</td><td>4</td><td>0</td><td>4</td><td>4</td><td>0</td><td>4</td><td>4</td><td>0</td><td>4</td><td>4</td><td>21</td><td>4</td><td>14</td><td>71	(71%)</td><td>4	(33.3%)</td><td>67	(76.1%)</td></tr><tr style="text-align: center;"><td>GLM4-9B+4v-9B</td><td></td><td>4</td><td>0</td><td>0</td><td>0</td><td>4</td><td>0</td><td>0</td><td>4</td><td>0</td><td>0</td><td>4</td><td>4</td><td>17</td><td>4</td><td>13</td><td>54	(54%)</td><td>4	(33.3%)</td><td>50	(56.8%)</td></tr><tr style="text-align: center;"><td>Yi-1.5-34B+VL-34B</td><td></td><td>4</td><td>0</td><td>0</td><td>0</td><td>0</td><td>0</td><td>4</td><td>4</td><td>0</td><td>4</td><td>0</td><td>4</td><td>20</td><td>4</td><td>9</td><td>53	(53%)</td><td>4	(33.3%)</td><td>49	(55.7%)</td></tr></table>

Questions with asterisks (\*) indicate that the question contains images. If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.


## Geography

The scores for each section of the test paper are as follows: 
<table border="1">
<tr style="text-align: center;">
    <th colspan="5" style="text-align: center;">Score Distribution for Each Question Type in Geography</th>
</tr>
<tr style="text-align: center;">
    <td>Model</td>
    <td>Multiple Choice Questions (Full Score: 44)</td>
    <td>Short Answer Questions (Full Score: 46)</td>
    <td>Elective Questions - Short Answer Questions (Full Score: 10)</td>
    <td>Total Score (Full Score: 100)</td>
</tr>
<tr style="text-align: center;">
    <td>Qwen2-72B</td>
    <td>40</td>
    <td>31</td>
    <td>10</td>
    <td>81</td>
</tr>
<tr style="text-align: center;">
    <td>Mixtral 8x22B</td>
    <td>36</td>
    <td>30</td>
    <td>8</td>
    <td>74</td>
</tr>
<tr style="text-align: center;">
    <td>GPT-4o</td>
    <td>32</td>
    <td>24</td>
    <td>10</td>
    <td>66</td>
</tr>
<tr style="text-align: center;">
    <td>WQX-20B+VL-20B</td>
    <td>24</td>
    <td>36</td>
    <td>4</td>
    <td>64</td>
</tr>
<tr style="text-align: center;">
    <td>GLM4-9B+4v-9B</td>
    <td>24</td>
    <td>28</td>
    <td>10</td>
    <td>62</td>
</tr>
<tr style="text-align: center;">
    <td>Yi-1.5-34B+VL-34B</td>
    <td>28</td>
    <td>16</td>
    <td>10</td>
    <td>54</td>
</tr>
<tr style="text-align: center;">
    <td>Qwen2-72B+VL-7B</td>
    <td>24</td>
    <td>0</td>
    <td>10</td>
    <td>34</td>
</tr>
<tr style="text-align: center;">
    <td>Qwen2-57B+VL-7B</td>
    <td>16</td>
    <td>0</td>
    <td>14</td>
    <td>30</td>
</tr>
</table>


The scores for each question in the test paper are as follows: 

<table border="1">
    <tr style="text-align: center;">
        <th rowspan="2">Geography</th>
        <th rowspan="2">Question number</th>
        <th colspan="11">Multiple Choice Questions</th>
        <th colspan="8">Short Answer Questions</th>
        <th colspan="2">Selected Questions - Short Answer</th>
        <th rowspan="2">Total Score</th>
        <th rowspan="2">Total Score for Questions with Diagrams</th>
        <th rowspan="2">Total Score for Questions without Diagrams</th>
    </tr>
    <tr style="text-align: center;">
        <th>1*</th><th>2*</th><th>3*</th><th>4</th><th>5</th><th>6*</th>
        <th>7*</th><th>8*</th><th>9*</th><th>10*</th><th>11*</th>
        <th>12.1*</th><th>12.2*</th><th>12.3*</th><th>12.4*</th>
        <th>13.1*</th><th>13.2*</th><th>13.3*</th><th>13.4*</th>
        <th>14</th><th>15</th>
    </tr>
    <tr style="text-align: center;">
        <td>Test Model</td><td>Score</td>
        <td>4</td><td>4</td><td>4</td><td>4</td><td>4</td><td>4</td>
        <td>4</td><td>4</td><td>4</td><td>4</td><td>4</td>
        <td>6</td><td>6</td><td>6</td><td>6</td><td>6</td>
        <td>4</td><td>8</td><td>4</td><td>10</td><td>10</td>
        <td>100 (100%)</td><td>82 (82%)</td><td>18 (100%)</td>
    </tr>
    <tr style="text-align: center;">
        <td>Qwen2-72B</td><td></td>
        <td>4</td><td>4</td><td>0</td><td>4</td><td>4</td><td>4</td>
        <td>4</td><td>4</td><td>4</td><td>4</td><td>4</td>
        <td>6</td><td>2</td><td>6</td><td>6</td><td>1</td>
        <td>2</td><td>4</td><td>4</td><td>10</td><td>10</td>
        <td>81 (81%)</td><td>63 (76.8%)</td><td>18 (100%)</td>
    </tr>
    <tr style="text-align: center;">
        <td>Mixtral 8x22B</td><td></td>
        <td>4</td><td>4</td><td>0</td><td>4</td><td>0</td><td>4</td>
        <td>4</td><td>4</td><td>4</td><td>4</td><td>4</td>
        <td>6</td><td>2</td><td>2</td><td>6</td><td>6</td>
        <td>0</td><td>4</td><td>4</td><td>5</td><td>8</td>
        <td>74 (74%)</td><td>62 (75.6%)</td><td>12 (66.7%)</td>
    </tr>
    <tr style="text-align: center;">
        <td>GPT-4o</td><td></td>
        <td>0</td><td>4</td><td>0</td><td>4</td><td>4</td><td>4</td>
        <td>0</td><td>4</td><td>4</td><td>4</td><td>4</td>
        <td>0</td><td>2</td><td>4</td><td>6</td><td>0</td>
        <td>4</td><td>4</td><td>4</td><td>10</td><td>10</td>
        <td>66 (66%)</td><td>48 (58.5%)</td><td>18 (100%)</td>
    </tr>
    <tr style="text-align: center;">
      <td>WQX-20B+VL-20B</td><td></td>
      <td>0</td><td>0</td><td>0</td><td>4</td><td>0</td><td>0</td>
      <td>4</td><td>4</td><td>4</td><td>4</td><td>4</td>
      <td>4</td><td>2</td><td>6</td><td>6</td><td>6</td>
      <td>4</td><td>4</td><td>4</td><td>0</td><td>4</td>
      <td>64 (64%)</td><td>56 (68.3%)</td><td>8 (42.1%)</td>
    </tr>
    <tr style="text-align: center;">
        <td>GLM4-9B+4v-9B</td><td></td>
        <td>4</td><td>4</td><td>0</td><td>4</td><td>0</td><td>0</td>
        <td>4</td><td>4</td><td>4</td><td>0</td><td>0</td>
        <td>4</td><td>2</td><td>6</td><td>6</td><td>2</td>
        <td>4</td><td>2</td><td>2</td><td>10</td><td>10</td>
        <td>62 (62%)</td><td>48 (58.5%)</td><td>14 (77.8%)</td>
    </tr>
    <tr style="text-align: center;">
        <td>Yi-1.5-34B+VL-34B</td><td></td>
        <td>4</td><td>0</td><td>0</td><td>0</td><td>0</td><td>4</td>
        <td>4</td><td>4</td><td>4</td><td>4</td><td>4</td>
        <td>4</td><td>2</td><td>4</td><td>6</td><td>0</td>
        <td>0</td><td>0</td><td>0</td><td>10</td><td>10</td>
        <td>54 (54%)</td><td>44 (53.7%)</td><td>10 (55.6%)</td>
    </tr>
    <tr style="text-align: center;">
        <td>Qwen2-72B+VL-7B</td><td></td>
        <td>0</td><td>0</td><td>0</td><td>4</td><td>4</td><td>0</td>
        <td>4</td><td>4</td><td>0</td><td>4</td><td>4</td>
        <td>0</td><td>0</td><td>0</td><td>0</td><td>0</td>
        <td>0</td><td>0</td><td>0</td><td>10</td><td>10</td>
        <td>34 (34%)</td><td>16 (19.5%)</td><td>18 (100%)</td>
    </tr>
    <tr style="text-align: center;">
        <td>Qwen2-57B+VL-7B</td><td></td>
        <td>0</td><td>0</td><td>0</td><td>4</td><td>0</td><td>0</td>
        <td>4</td><td>4</td><td>0</td><td>4</td><td>4</td>
        <td>0</td><td>0</td><td>0</td><td>0</td><td>0</td>
        <td>0</td><td>0</td><td>0</td><td>10</td><td>10</td>
        <td>30 (30%)</td><td>16 (19.5%)</td><td>14 (77.8%)</td>
    </tr>
</table>

Questions with asterisks (\*) indicate that the question contains images. If the model name includes "+VL," it indicates that questions involving images will be processed using the corresponding multimodal version of the model for inference. If "+VL" is not present, only text-based inference will be performed without considering images.

## Overall Feedback from Teachers

After grading all subjects, we informed the teachers that the answers to the above examination papers were generated by large models. We then invited the grading teachers to provide feedback on the overall performance of the seven large models.

**Comments from the Chinese Teacher**:  
The large models perform adequately in translating classical Chinese texts; however, they largely fail in subjective questions, struggling to understand the prompts and often misinterpreting the referents of certain pronouns, leading to irrelevant answers. The essays generated by the models do not resemble typical Gaokao essays and are more akin to Q&A responses. Although they are targeted, they lack embellishment. Human candidates typically use examples and citations, often quoting famous individuals and utilizing character materials, but the models rarely do so. When tasked with writing a metaphorical sentence, the models almost universally failed, confusing the tenor and vehicle, indicating a lack of understanding of the "metaphor" technique and what the "tenor" is. Additionally, they struggled with sentence completion tasks, showing difficulty in maintaining coherence with the context and adhering to certain linguistic conventions in Chinese. For instance, if a new concept like "sleep quality" appears later in the text, it should also be included in the sentence completion; otherwise, its sudden introduction feels abrupt and disconnected. The models also struggle to grasp subtleties in language.

**Comments from the Liberal Arts Mathematics Teacher**:  
Most objective questions were analyzed correctly, although a small number had discrepancies between the analysis process and the answer choices, leading to incorrect options. For subjective questions, the models often failed to address the second question, with responses primarily focused on analysis but lacking depth. Errors occurred during the problem-solving process, with repeated code segments; for example, in Question 17, most could derive \( a_n \), but the subsequent content differed significantly from human responses. In Question 18, while many correctly set up the equation for \( K \), the calculation results were incorrect. In Question 19, some responses fabricated known conditions, lacking specific written content. In later geometry questions, the models exhibited glaring issues with vertical and parallel reasoning, leading to absurd conclusions. In the inequality proofs, they added their own known conditions and attempted to prove based on those. Overall, the subjective questions lacked logical reasoning.

**Comments from the Science Mathematics Teacher**:  
The models generally exhibited a mechanical approach to problem-solving, with many unable to reach correct conclusions through normal reasoning processes. For instance, in the first fill-in-the-blank question, the models could only complete a small portion of the process and arrive at a result, lacking the comprehensive analysis and complete calculation steps typical of human candidates. Additionally, for geometry questions, the models presented absurd proof processes for plane geometry and did not employ standard calculation methods for solid geometry. While the models demonstrated good memorization of basic formulas, they struggled with flexible application. Some questions had correct results but flawed logical processes, making grading challenging.

**Comments from the English Teacher**:  
In terms of completion, the models generally met the requirements of the questions. However, issues were unavoidable; for instance, in cases of lengthy questions, the models sometimes failed to identify the problems, resulting in unanswered questions. Furthermore, some responses did not adhere to the requirements, such as failing to specify the title and opening sentence in essays or not indicating errors during correction tasks, instead presenting the revised text directly. During grading, it was noted that the models' analyses of questions often differed from the typical thought processes of human candidates, with language filled with clichés and overly standardized formats, making the models' outputs overly conspicuous.

**Comments from the Politics Teacher**:  
Overall, the models had a low accuracy rate for multiple-choice questions, and their responses to short answer questions were overly mechanical. Particularly concerning was the first short answer question about the principal body of the National People's Congress, which none of the models answered correctly, failing to reference textbook knowledge. They did not integrate textbook concepts, mechanically repeating material without connecting to theoretical knowledge. Additionally, the models struggled to accurately interpret questions, which is a common issue across all models. They often failed to analyze the angle of the questions, such as identifying whether they were asking about significance, reasons, or measures, resulting in insufficiently standardized responses. The only questions that scored relatively well were those requiring mechanical knowledge, such as the question on dialectical thinking, which scored well because it fell within a narrow knowledge range and was considered an easy question in the exam.

**Comments from the History Teacher**:  
The exam emphasizes material analysis, grounding itself in textbooks while assessing abilities, embodying the idea that "the questions are outside the book, but the answers are within the materials." It stresses the memorization of significant events and the understanding of historical phenomena and their corresponding conclusions. The assessment focuses on core knowledge and fundamental skills, with approximately 50% of the content at the memorization level and the other 50% at the understanding and memorization level. The knowledge coverage is broad, including detailed assessments of certain nuanced issues from the textbooks.  
Issues with responses include:  
- A good grasp of basic knowledge but a lack of analytical ability regarding effective information, leading to poor comprehension of questions and an inability to flexibly apply learned knowledge to solve related issues. There is a pressing need to improve response habits and methods, and language expression remains poor, with a tendency toward colloquialism.
- The models demonstrate poor understanding of questions, needing improvement in reading comprehension, particularly in extracting effective information from materials as answers, which hinders their ability to grasp key assessment points.
- The response format is lacking; short answer questions are written as small essays, with no habit of answering in bullet points.
- Basic knowledge from textbooks is not firmly grasped, and memory is often inaccurate.
- The thought process in answering questions is unclear, lacking close adherence to material analysis, and irrelevant information is sometimes included. Questions 15 and 17 were answered relatively well, generally scoring high, while small essays were poorly executed, indicating a lack of careful reading of questions. Candidates should first answer the questions before elaborating, and they often fail to clearly state their viewpoints, leading to significant formatting issues, with either insufficient or excessive word counts.

**Comments from the Geography Teacher**:  
The large models demonstrated comprehensive coverage of geographical knowledge in their responses, addressing topics from physical geography to human geography, and from geographical phenomena to geographical principles. They performed particularly well in assessing foundational knowledge. However, when faced with questions requiring in-depth analysis or reasoning, they exhibited certain deviations and omissions, resulting in poor performance on unconventional and more open-ended questions.

**Comments from the Physics Teacher**:  
Overall, the models felt quite mechanical, often failing to grasp the meaning of questions. Some multiple-choice questions were answered correctly, but the analysis was still flawed. Errors were particularly prevalent in reading comprehension questions, where the provided answers diverged significantly from the correct ones. Some larger questions had convoluted steps lacking logical coherence, often involving circular reasoning where the conclusion was used as evidence for itself, making no sense. Additionally, there were issues with the standardization of steps, frequently skipping steps in the process. Several questions lacked specific data, with only analytical processes presented without solutions, despite the common practice in high school physics of expressing results using variables. There were instances where no answer was selected for multiple-choice questions, and in the experimental reading section, all responses were based on assumed unknowns without specific values. These are all fundamental errors that would not typically occur with students.

**Comments from the Chemistry Teacher**:  
Overall, the models exhibited a low accuracy rate. In multiple-choice questions, there were issues with incomplete recognition of questions, particularly with the last four logic-heavy questions that could not be answered correctly. In fill-in-the-blank questions, the models rarely pinpointed the correct scoring points, and the writing of chemical equations lacked accuracy and often included garbled text. They demonstrated a lack of logical reasoning abilities and struggled with topics involving organic and inorganic elements that required strong logical connections.

**Comments from the Biology Teacher**:  
The models generally performed poorly on objective multiple-choice questions that included diagrams, with some single-choice questions misidentified as multiple-choice. In subjective questions, particularly those requiring calculations of genotype numbers, the models frequently made errors and failed to fully understand the question prompts. They exhibited a higher error rate on questions involving diagrams, with some answers containing garbled text and failing to list the multiple items required by the questions.